Introduction

This analysis looks into a building a model to for understanding \(\mathbb{P}(\text{success})\).

We first need to get the data from mlflow. This analysis goes into detail on the steps required to go from loading the data to building various models. We also conduct an EDA on the instance features.

Set up Environment

# Import libraries
library(here)
library(igraph)
library(cowplot)
library(grid)
library(gridExtra)
library(tidyverse)
library(GGally)
library(RColorBrewer)
library(corrplot)

# Set knitr options
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F, error = F, fig.align = 'center', fig.width = 10)

# Source in relavent scripts
source(here("utils/mlflow-utils.R"))
source(here("utils/plotting-utils.R"))
source(here("analysis/enrichments/variable-clause-graph.R"))
source(here("analysis/enrichments/variable-graph.R"))

# Load in enriched data
d_enriched <- read_rds(here("data/d_enriched.rds"))
d_runs <- get_mlflow_data(here("data/d_runs.csv"))
## 2020-05-28 12:27:36 INFO::Data last updated 2020-05-27 23:13:37

Our enriched dataset has a naming convention based on the prefix in each column, we describe the numeric columns below:

Variable Clause Graph

Variable-Clause Graph (VCG) is a bipartite graph with a node for each variable, a node for each clause, and an edge between them whenever a variable occurs in a clause.

example_d_clauses = list(
  k_1 = c(1L, 3L, 4L),
  k_2 = c(1L, 5L, 6L),
  k_3 = c(1L, 4L, 6L),
  k_4 = c(2L, 5L, 6L),
  k_5 = c(1L, 2L, 5L)
  )

colrs = c("tomato", "gold")
example_qubits = 6
example_var_clause_graph = make_variable_clause_graph(d_clauses = example_d_clauses, example_qubits, ret.all = T)


ggnet2(example_var_clause_graph$vcg, color = "color", palette = "Set2") +
    scale_color_brewer(
      "", 
      palette = "Set1",
      labels = c("tomato" = "Variable", "gold" = "Clause"),
      guide = guide_legend(override.aes = list(size = 6))
    )

Variable Graph

Variable Graph (VG) has a node for each variable and an edge between variables that occur together in at least one clause.

example_d_clauses = list(
  k_1 = c(1L, 3L, 4L),
  k_2 = c(1L, 5L, 6L),
  k_3 = c(1L, 4L, 6L),
  k_4 = c(2L, 5L, 6L),
  k_5 = c(1L, 2L, 5L)
)

example_qubits = 6

d_clauses = example_d_clauses
n_qubits = example_qubits

var_graph = make_variable_graph(example_d_clauses, example_qubits)
ggnet2(var_graph)

Normalise Data

Before modelling we need to ensure that each configuration of parameters has the same number of randomly generated samples. We will only take runs that have had atleast \(n=5\) completed runs. We will also enrich another column called use_config (bool). This column will indicate whether that instance should be used in our analysis.

min_runs <- 10

# Find run configs with more than n=min_runs
d_configs <- d_enriched %>% 
  select_if(is.numeric) %>% 
  select(starts_with("params")) %>% 
  count(params_n_qubits, params_time_t, params_t_step) %>% 
  arrange(n) %>% 
  filter(n >= min_runs) %>% 
  mutate(load = T)

d_sampled <- d_enriched %>% 
  left_join(d_configs, by = c("params_n_qubits", "params_time_t", "params_t_step")) %>% 
  filter(load == T) %>% 
  group_by(params_n_qubits, params_time_t, params_t_step) %>% 
  sample_n(min_runs) %>% 
  select_if(is.numeric) %>% 
  select(-experiment_id, -n) %>% 
  ungroup()

Experiments Run

Before we start building a model it’s useful to have a look at what experiments have ran so far.

d_runs %>% 
  count(params_n_qubits, params_t_step, params_time_t) %>% 
  arrange(n) %>% 
  ggplot(aes(x = as.factor(params_n_qubits), y = n, fill = as.factor(params_time_t))) + 
  geom_col(aes(fill = as.factor(params_time_t)), position = position_dodge(width = 0.9)) +
  geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.7, size = 2.6) + 
  scale_fill_brewer(palette = "Blues", name = "Evolution Time") +
  theme_light() +
  facet_wrap(~params_t_step) + 
  theme(
    legend.position = "bottom"
  ) + 
  labs(
    x = "n qubits",
    y = "Number of runs",
    title = "Summary of Experiments Run "
  )

Great, now when we sample our dataset, we see each column has the same number of completed runs in our case 10.

d_sampled %>% 
  count(params_n_qubits, params_t_step, params_time_t) %>% 
  arrange(n) %>% 
  ggplot(aes(x = as.factor(params_n_qubits), y = n, fill = as.factor(params_time_t))) + 
  geom_col(aes(fill = as.factor(params_time_t)), position = position_dodge(width = 0.9)) +
  geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.7, size = 2.6) + 
  scale_fill_brewer(palette = "Blues", name = "Evolution Time") +
  theme_light() +
  facet_wrap(~params_t_step) + 
  theme(
    legend.position = "bottom"
  ) + 
  labs(
    x = "n qubits",
    y = "Number of runs",
    title = "Summary of Experiments Run "
  )

Energy and Entanglement vs Number of Qubits

d_sampled %>% 
  group_by(params_n_qubits) %>% 
  summarise(
    entanglement = mean(metrics_max_shannon_entropy,na.rm = T),
    min_energy = mean(metrics_min_energy_gap, na.rm = T)
  ) %>% 
  gather(metric, value, -params_n_qubits) %>% 
  ggplot(aes(x = params_n_qubits, y = value, group = 1)) + 
  geom_line() + 
  facet_wrap(~metric, scales = "free", ncol = 1) + 
  theme_light() +
  labs(
    x = "N Qubits",
    y = "Value"
  )

Shannon Entropy vs Clause-to-variable ratio (by qubits)

We notice that for a smaller number of qubits we have almost no variance in the number of instances generated (this is noticed by clause-to-variable-ratio). We also see that our maximum shannon entropy decreases as we run our evolution longer.

d_sampled %>% 
  select(
    metrics_clause_var_ratio, 
    metrics_min_energy_gap, 
    metrics_max_shannon_entropy, 
    params_n_qubits, 
    params_time_t
  ) %>% 
  mutate_all(as.numeric) %>% 
  mutate(
    params_n_qubits = as.factor(params_n_qubits)
    ) %>% 
  ggplot(aes(x = metrics_clause_var_ratio, y = metrics_max_shannon_entropy, col = params_time_t)) +
  geom_point(alpha = 0.8) + 
  facet_wrap(~params_n_qubits) + 
  theme_light() + 
  labs(
    x = "Clause-Variable Ratio",
    y = "Max Shannon Entropy",
    title = "Shannon Entropy vs Clause-Var Ratio (by qubits)"
  )

\(\mathbb{P}(\text{success})\) over T

One of the key metrics we’re interested in is \(\mathbb{P}(\text{success})\). As we see probability of sucess increasing as the evolution run time of the algorithm continues, we observe a higher \(\mathbb{P}(\text{success})\). We also observe a higher variance.

d_sampled %>% 
  select(metrics_p_success, params_n_qubits, params_time_t, params_t_step) %>% 
  mutate(
    params_time_t = as.numeric(params_time_t),
    params_t_step = as.factor(params_t_step)
  ) %>% 
  ggplot(aes(x = params_time_t, y = metrics_p_success, col = params_t_step)) + 
  geom_point(alpha = 0.5) + 
  facet_wrap(~as.numeric(params_n_qubits)) +
  labs(
    y = "Probability of Success",
    x = "Evolution Time"
  ) + 
  theme_light()

Shannon Entropy vs \(\mathbb{P}(\text{success})\) (by T)

Here we clearly observe that a lower shannon entropy corresponds to a higher probability of success, but this may also be confounded due to the algorithm run time, \(T\).

d_sampled %>% 
  select(metrics_p_success, metrics_max_shannon_entropy, params_time_t) %>% 
  ggplot(aes(x = metrics_p_success, y = metrics_max_shannon_entropy, col = params_time_t)) + 
  geom_point(alpha = 0.5) +
  labs(
    x = "Probability of Success",
    y = "Shannon Entropy"
  ) +
  theme_light()

d_sampled %>% 
  select(metrics_p_success, metrics_max_shannon_entropy, params_time_t, params_n_qubits) %>% 
  ggplot(aes(x = metrics_p_success, y = metrics_max_shannon_entropy, col = params_time_t)) + 
  geom_point(alpha = 0.6) +
  facet_wrap(~params_n_qubits) + 
  labs(
    x = "Probability of Success",
    y = "Shannon Entropy",
    title = " P(success) vs Entropy (by Qubits)"
  ) +
  theme_light()

Min Energy Gap vs \(\mathbb{P}(\text{success})\)

d_sampled %>% 
  select(metrics_p_success, metrics_min_energy_gap, params_time_t) %>% 
  ggplot(aes(x = metrics_p_success, y = metrics_min_energy_gap, col = params_time_t)) + 
  geom_point(alpha = 0.6) +
  theme_light() +
  labs(
    x = "Probability of Success",
    y = "Min Energy Gap",
    title =  "P(success) vs Min Energy Gap"
  )

d_sampled %>% 
  select(metrics_p_success, metrics_min_energy_gap, params_time_t, params_n_qubits) %>% 
  ggplot(aes(x = metrics_p_success, y = metrics_min_energy_gap, col = params_time_t)) + 
  geom_point(alpha = 0.8) +
  facet_wrap(~params_n_qubits) + 
  theme_light() + 
  labs(
    x = "Probability of Success",
    y = "Min Energy Gap",
    title = "P(Success) vs Energy Gap (by Qubit)"
  )

Clause to Var Ratio vs \(\mathbb{P}(\text{success})\)

d_sampled %>% 
  select(metrics_p_success, metrics_clause_var_ratio, params_n_qubits, params_time_t) %>% 
  ggplot(aes(x = metrics_clause_var_ratio, y = metrics_p_success, col = params_time_t)) + 
  geom_point() + 
  facet_wrap(~params_n_qubits) +
  theme_light() + 
  labs(
    x = "Clause-Variable Ratio",
    y = "P(Success)"
  )

Features - Problems Size

Problem Size Features w.r.t \(\mathbb{P}(\text{success})\)

Problem Size Features w.r.t Shannon Entropy

Problem Size Features w.r.t Min Energy Gap

Features - Variable Clause Graph

Variable Clause Graph Features w.r.t \(\mathbb{P}(\text{success})\)

Variable Clause Graph Features w.r.t Shannon Entropy

Variable Clause Graph Features w.r.t Min Energy Gap

Features - Variable Graph

Variable Graph Features w.r.t \(\mathbb{P}(\text{success})\)

Variable Graph Features w.r.t Shannon Entropy

Variable Graph Features w.r.t Min Energy Gap

Modelling \(\mathbb{P}(\text{success})\)

We can now begin to model \(\mathbb{P}(\text{success})\), let’s start off with a simple linear model.

First we need to remove columns with no standard deviation and also ones with missing values.

We can now checkout a correlation matrix.

d_model <- d_sampled %>% 
  select_if(is.numeric) %>% 
  select(-starts_with(c("metric", "params")), metrics_p_success) %>% 
  filter_all(all_vars(!is.na(.))) %>% 
  Filter(f = sd, .)

M <-cor(d_model)
corrplot(
  M, 
  type="upper", 
  order="original",
  col=brewer.pal(n=8, name="RdYlBu"),
  tl.cex = 0.5,
  tl.col = "blue"
  )

Let’s look at the model summary:

d_model %>% 
  lm(data = ., formula = metrics_p_success ~ .) %>% 
  summary()
## 
## Call:
## lm(formula = metrics_p_success ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45134 -0.16125 -0.02477  0.18398  0.53792 
## 
## Coefficients: (3 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             -1.240e+06  1.022e+06  -1.213  0.22564   
## f_p_size_n_clauses      -3.616e-01  1.995e-01  -1.812  0.07056 . 
## f_p_size_n_variables     1.993e-01  1.599e-01   1.247  0.21305   
## f_p_size_ratio           1.173e+06  9.742e+05   1.204  0.22924   
## f_p_size_ratio_sq       -5.909e+05  4.945e+05  -1.195  0.23263   
## f_p_size_ratio_cub       1.239e+05  1.044e+05   1.187  0.23578   
## f_p_size_ratio_recp      7.357e+05  6.015e+05   1.223  0.22188   
## f_p_size_ratio_recp_sq  -2.325e+05  1.885e+05  -1.233  0.21800   
## f_p_size_ratio_recp_cub  3.055e+04  2.455e+04   1.244  0.21404   
## f_p_size_lin_ratio              NA         NA      NA       NA   
## f_p_size_lin_ratio_sq           NA         NA      NA       NA   
## f_p_size_lin_ratio_cb           NA         NA      NA       NA   
## f_vcg_mean_var           1.318e+00  6.503e-01   2.027  0.04318 * 
## f_vcg_median_var         5.799e-02  3.404e-02   1.703  0.08916 . 
## f_vcg_min_var            2.720e-01  1.917e-01   1.419  0.15663   
## f_vcg_max_var           -3.946e-02  2.613e-02  -1.510  0.13162   
## f_vg_mean               -7.060e-02  8.772e-02  -0.805  0.42131   
## f_vg_median              1.965e-02  3.672e-02   0.535  0.59289   
## f_vg_min                -2.946e-01  1.828e-01  -1.612  0.10768   
## f_vg_max                 7.702e-02  2.434e-02   3.164  0.00166 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2308 on 473 degrees of freedom
## Multiple R-squared:  0.2269, Adjusted R-squared:  0.2007 
## F-statistic: 8.674 on 16 and 473 DF,  p-value: < 2.2e-16